Skip to content

Variant indexer performance improvements#1551

Merged
oblodgett merged 1 commit intostagefrom
variant_indexer_improvements
Apr 3, 2026
Merged

Variant indexer performance improvements#1551
oblodgett merged 1 commit intostagefrom
variant_indexer_improvements

Conversation

@oblodgett
Copy link
Copy Markdown
Member

Summary

  • sortSmartAlpha/sortLowercase split: Replace single sort() method with explicit sortSmartAlpha() and sortLowercase() on FieldBuilder. Switch 12 of 17 variant index sort fields to sortLowercase where values are pure text — eliminates ~7 billion regex evaluations during indexing (600M+ docs). Hot thread analysis showed the PatternReplaceCharFilter regex consuming 95% of ES write thread CPU.
  • Crash detection: Add catch-all Exception handler in RoutedBulkIndexer with stack trace logging + System.exit(-1). Investigation found 7 of 32 MGD BP threads dying silently during indexing with no diagnostics.

Fields switched to sortLowercase (variant index only)

Field Reason
alterationTypeSortOrder Integer — no text sort needed
consequence.vepImpact.name Pure text: HIGH, LOW, MODERATE, MODIFIER
consequence.vepConsequences.name Pure text (3_prime_UTR numbers don't affect sort)
consequence.variantTranscript.transcriptType.name Pure text: protein_coding, ncRNA, etc.
consequence.siftPrediction.name Pure text: tolerated, deleterious
consequence.polyphenPrediction.name Pure text: benign, probably_damaging
variant.variantType.name / variants.variantType.name Pure text: SNP, deletion, insertion
variants.*.vepConsequences.name Same as consequence.vepConsequences
associatedPhenotype, diseaseTerms.name, alleleSynonyms.displayText Empty/unused in variant index

Fields kept on sortSmartAlpha

Field Reason
symbol HGVS notation: NC_000078.7:g.34620827C>T
*.hgvs HGVS notation
consequence.variantTranscript.name Transcript IDs: ENSEMBL:ENSMUST00000094140
geneSymbol.displayText Gene names with numbers: Hdac9, BRCA1

Test plan

  • Verify variant indexer compiles and runs
  • Confirm sort behavior unchanged on variant pages (sort by vepImpact, variantType, symbol)
  • Monitor ES write thread CPU during next indexer run — expect significant reduction
  • Watch for RuntimeException crash output from the new catch-all handler

- Add sortSmartAlpha() and sortLowercase() to FieldBuilder, replacing
  the single sort() method. sortSmartAlpha uses the regex-based
  smart_alpha_sort normalizer (zero-pad numbers for natural sort).
  sortLowercase uses the simple lowercase normalizer (no regex).

- Switch 12 of 17 variant index sort fields to sortLowercase where
  field values are pure text (vepImpact, siftPrediction, polyphenPrediction,
  transcriptType, variantType, vepConsequences, alterationTypeSortOrder,
  associatedPhenotype, diseaseTerms, alleleSynonyms). Only 5 fields with
  meaningful numbers (symbol, hgvs, variantTranscript.name, geneSymbol)
  remain on sortSmartAlpha.

  This eliminates ~7 billion regex evaluations during variant indexing
  (600M+ docs). The smart_alpha_sort regex (PatternReplaceCharFilter)
  was consuming 95% of ES write thread CPU per hot thread analysis.

- Add catch-all Exception handler in RoutedBulkIndexer to log uncaught
  RuntimeExceptions that silently kill BP threads, then System.exit(-1)
  to fail fast. Investigation found 7 of 32 MGD BP threads dying silently
  during indexing with no stack trace.

- All site_index sort() calls renamed to sortSmartAlpha() (no behavior change).
@oblodgett oblodgett requested a review from a team as a code owner April 3, 2026 17:24
@claude
Copy link
Copy Markdown

claude bot commented Apr 3, 2026

Code Review

Overall: Changes look correct. The sort strategy split is well-reasoned and the field-level choices in VariantMapping.java match the rationale in the PR description.

One issue to consider

System.exit(-1) in RoutedBulkIndexer.run() (line 128) — Two minor points:

  1. e.printStackTrace() is redundantlog.error(label + " Unexpected ...", e) already logs the full stack trace. The e.printStackTrace() just duplicates it to stderr.

  2. System.exit(-1) bypasses the finally blockSystem.exit() never returns, so the finally at line 130 won't execute. This means: the ES client won't be closed gracefully (client.close() at line 137), and if gatherStats is true, ph.finishProcess() / logStats() won't run. For a crashing process this is acceptable, but worth being aware of — particularly if shutdown hooks in the ES client expect a clean close. A Runtime.getRuntime().halt(-1) would be more explicit about the intent if you truly want immediate termination, or moving System.exit(-1) after the finally block (using a flag) would preserve the cleanup.

Mapping changes look correct

  • The sortSmartAlpha / sortLowercase builder methods both emit a sub-field named "sort" — they are mutually exclusive by design. No field in this PR uses both, so no conflict exists. A guard/assertion preventing both flags from being set would be a nice future safety net, but isn't necessary for this PR.
  • The field-by-field sort strategy choices in VariantMapping.java (lowercase for pure-text enum values, smart_alpha for HGVS/transcript IDs/gene symbols) are well-justified.
  • Mapping.java fields uniformly move to sortSmartAlpha, preserving existing behavior.

LGTM with the minor System.exit notes above.

@oblodgett oblodgett enabled auto-merge April 3, 2026 17:27
@oblodgett oblodgett merged commit 05b717e into stage Apr 3, 2026
5 checks passed
@oblodgett oblodgett deleted the variant_indexer_improvements branch April 3, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants